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The stochastic simulation of biological systems is an increasingly pop- 
ular technique in bioinformatics. It often is an enlightening technique, 
which may however result in being computational expensive. We dis- 
cuss the main opportunities to speed it up on multi-core platforms, which 
pose new challenges for parallelisation techniques. These opportunities 
are developed in two general families of solutions involving both the sin- 
gle simulation and a bulk of independent simulations (either replicas of 
derived from parameter sweep) . Proposed solutions are tested on the par- 
allelisation of the CWC simulator (Calculus of Wrapped Compartments) 
that is carried out according to proposed solutions by way of the FastFlow 
programming framework making possible fast development and efficient 
execution on multi-cores. 

Keywords multi-core, parallel simulation, stochastic simulation, SIMD, 
lock-free synchronisation. 

1 Introduction 

Stochastic simulations are an increasingly popular technique to study biological 
systems. They - differently from other modelling approaches such as differential 
equations (ODEs) - are able to describe transient, and multi-stable behaviours 
of the systems. 

Different formalisms, based on automata models (like [4]), process algebras 
(like [25, 10]) or rewrite systems (like [26]) have cither been applied to, or 
inspired from, biological systems. Quantitative simulations of biological models 
represented with these kinds of frameworks (see, e.g. [25, 22, 14]) are usually 
developed via a stochastic method derived by Gillespie's algorithm [18]. 

Among other formalisms, the Calculus of Wrapped Compartments (CWC) 
[12] is a recently proposed rewriting-based language for the representation and 
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simulation of biological systems. It has been designed with the aim of sim- 
plifying the development of efficient implementations, while keeping the same 
expressiveness of other more complex languages. 

Stochastic simulations are computationally more expensive than ODEs nu- 
merical solution. This is particularly true for the kind of systems that are 
better represented by stochastic models since, for their uneven nature, should 
be simulated at a very fine grain to spot possible spikes of the modelled phe- 
nomena along time, or to discriminate families of possible behaviour that are 
not revealed by the averaged behaviour described by ODEs. 

The high computational cost of stochastic simulation is well known and has 
led, in the last two decades, to a number of attempts to accelerate them up 
using several kind of techniques, such as approximate simulation algorithms 
and parallel computing [17]. In this work, this latter approach is taken into 
account. 

Since stochastic simulations are basically Monte Carlo processes, many inde- 
pendent instances should be computed to achieve a statistically valid solution. 
These independent instances have been traditionally exploited in an embarrass- 
ingly parallel fashion, executing a partition of the instances in a different ma- 
chine. This approach naturally couples with distributed computing (i.e. cluster, 
grid, clouds). 

However, the entire hardware industry has been moving to multi-core, which 
nowadays equips the large majority of computing platforms. The rapid shift to- 
wards multi-core technology has many drivers that are likely to sustain this trend 
for several years to come. These platforms, which are increasingly diffused in sci- 
entific laboratories, typically offer moderate to high peak computational power. 
This potential power, however, cannot always be turned into actual application 
speed. This is particularly true for I/O- and memory-bound applications since 
all the cores usually share the same memory and I/O subsystem. 

The analysis of biological systems produces a large amount of data, often 
organised in streams coming from either analysis instruments or simulators. 
The management of these streams in not trivial on multi-core platforms as the 
memory bandwidth cannot usually sustain a continuous flux of data coming 
form all the cores at the same time. 

A related aspect regards analysis of the simulation results, which requires 
the merging of results from different simulation instances and possibly their 
statistical filtering or mining. In distributed computing, this phase is often de- 
moted to a secondary aspect in the computation and treated as with off-line 
post-processing tools. However, this approach is no longer realistic because of 
both 1) the ever-increasing size of produced data and, 2) it insists on the main 
weakness of multi-core platforms, i.e. memory bandwidth and core synchroni- 
sations. 

In this paper we propose a critical rethinking of the parallelisation of stochas- 
tic processes in the light of emerging multi-core platforms and the tools that 
are required to derive an efficient simulator from both performance and easy 
engineering viewpoints. We believe that this latter aspect is of crucial impor- 
tance for next generation biological tools because they will be largely designed 
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by bioinformatic scientists, who will be certainly much more interested in the 
accurate modelling of natural phenomena rather than on the synchronisation 
protocols required to build efficient tools on multi-core platforms. 

We use the CWC calculus and its sequential simulator (Sec. 2) as paradig- 
matic example to discuss the key features required to derive an easy porting on 
a multi-core platform (Sec. 3). In particular we will argument on the paralleli- 
sation of a single simulation instance (Sec. 3.1.1), many independent instances 
(Sec. 3.1.2), and the technical challenges they require. Among these, parallel 
programming tools and frameworks for multi-core are discussed in Sec. 4, in 
particular we will focus on stream oriented pattern-based parallel programming 
supported by the FastFlow framework (Sec. 4.1). 

The key features discussed in Sec. 2 are turned into a family of solutions to 
speed up both the single simulation instance and many independent instances. 
The former issue is approached using SIMD hardware accelerators (Sec. 5.1), 
the latter advocating a novel simulation schema based on FastFlow accelerator 
that guarantees both easy development and efficient execution (Sec. 5.2). The 
proposed solutions are experimentally evaluated. 

2 The Calculus of Wrapped Compartments 

The Calculus of labelled Wrapped Compartments (CWC) (a small variant of 
the one presented in [12]) is based on a nested structure of ambients delimited 
by membranes with specific proprieties. Biological entities like cells, bacteria 
and their interactions can be easily described in CWC. 

2.1 CWC 

Let A be a set of atomic elements (atoms for short), ranged over by a, b, 
and C a set of compartment types represented as labels ranged over by £, . . .. 

A term of CWC is a multiset t of simple terms where a simple term is 
either an atom a or a compartment (a J t') e consisting of a wrap (represented 
by the multiset of atoms o), a content (represented by the term t') and a type 
(represented by the label £). 

An example of term is a b (c d\e f) 1 representing a multiset (multisets are 
denoted by listing the elements separated by a space) consisting of two atoms 
a and b (for instance two molecules) and an £-type compartment (c d\ e f) 1 
which, in turn, consists of a wrap (a membrane) with two atoms c and d (for 
instance, two proteins) on its surface, and containing the atoms e (for instance, 
a molecule) and / (for instance a DNA strand). 

System transformations are defined by rewriting rules. A rewriting rule is 
defined as a pair of terms (on an extended set of atomic elements which includes 
variables), which represent the patterns, ranged over by P, O, together with a 
label £ representing the compartment type to which the rule can be applied. 
Rules are represented as expression of the form £ : P M> O. A simple example 
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of a rewrite rule is 

t: abX ^ cX 

meaning that in all compartments of type £ an occurrence of a, b (X can match 
with all the remaining part of the compartment content) can be replaced by c. 

The application of a rule t : P ^ O to a term t is performed in the following 
way: 

• find (if it exists) the content (or the wrap) u of a compartment of type I 
in t and an substitution a of variables by terms such that u = cr(P). 

• replace in t the subterm u with a(0). 

We write 1 1-> f if t' is obtained by applying a rewrite rule to t. 

2.2 Stochastic Simulation 

A stochastic simulation model for biological systems can be defined by incorpo- 
rating a collision-based stochastic framework along the line of the one presented 
by Gillespie in [18], which is, de facto, the standard way to model quantitative 
aspects of biological systems. The idea of Gillespie's algorithm is that a rate 
constant is associated with each considered chemical reaction. Such a constant 
is obtained by multiplying the kinetic constant of the reaction by the number 
of possible combinations of reactants that may occur in the system (thus mod- 
elling the law of mass action, but more flexible approaches are also considered 
in the literature [12]). The resulting rate is then used as the parameter of an 
exponential distribution modelling the time spent between two occurrences of 
the considered chemical reaction. 

Each reduction rule is enriched by the kinetic constant k of the reaction 

k 

that it represents (notation I : P i — > O). The number of reactants in a reaction 
represented by a rewrite rule is evaluated considering the number of distinct 
occurrences, in the same context, of subterms matching with the considered 
rule. For instance in evaluating the application rate of the stochastic rewrite 

k — 

rule R — I : a b X i — > c X to the term t = a a b b in a compartment of 
type I we must consider the number of the possible combinations of reactants 
of the form a bint. Since each occurrence of a can react with each occurrence 
of b, this number is 4. So the application rate of R is k ■ 4. This number can be 
evaluated by specific algorithms (we refer to[12] for a more detailed account). 
The stochastic simulation algorithm is essentially a Continuous Time Markov 
Chain (CTMC). Given a term t, a set 1Z of reduction rules, a global time S and 
all the reductions e\, . . . , eju applicable to t, with rates n, . . . ,Tm such that 
r = r ii the standard simulation procedure that corresponds to Gillespie's 

simulation algorithm [18] consists of the following two steps: 

1. The time 5 + r at which the next stochastic reduction will occur is ran- 
domly chosen with r exponentially distributed with parameter r; 
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2. The reduction e% that will occur at time S + t is randomly chosen with 
probability r^/r. 

2.3 The CWC simulator 

The CWC simulator is a tool strictly based on Gillespie algorithm [29]. It 
iterates the following three logical steps: 

1. Match: it searches for the occurrences of the rules (object trees) inside the 
term (subject tree), according to the notion of contexts. Then it associates 
a stochastic rate to each match. This step results into a weighted matchset. 

2. Resolve (Monte Carlo step): it stochastically decides the time (offset) at 
which the next reaction will occur and the rule that will activate it. More- 
over, since in CWC reactions can occur at different contexts, it consults 
the matchset in order to decide how portion of the system will react. 

3. Update: if effectively applied the selected reaction, affecting both the sys- 
tem and the clock, moving forward the simulation process. 

3 Exploiting Parallelism in Stochastic Simula- 
tions 

Gillespie algorithm realises a Monte Carlo type simulation method, thus it relies 
on repeated random sampling to compute the result. An individual simulation, 
which tracks the state of the system at each time-step, is called a trajectory. 
Each individual trajectory represents just one possible way in which the system 
might have reacted over the time-span from the start-time to the stop-time. 
Many thousands of trajectories might be needed to get a representative picture 
of how the system behaves on the whole. An example of the synthesis of many 
trajectories obtained from the CWC simulator is reported in Fig. 1. In the figure, 
the two curves are obtained by averaging 100 trajectories from independent 
simulation instances at fixed simulation time steps and computing variance with 
90% confidence intervals over them. 

For this, stochastic simulations are computationally more expensive than 
ODEs numerical unfolding. This balance is well-known and it motivated many 
attempts to speed up their execution time along last two decades [17]. They can 
be roughly categorised in attempts that tackle the speeding up of a single simula- 
tion and a bulk of independent simulations. In the following these (not mutually 
exclusive) approaches are discussed under the viewpoint of parallel computing 
techniques and their exploitation on commodity multi-core platforms. This dis- 
cussion is not intended to be an encyclopaedic review of other techniques that 
can be used to achieve the same aim, such as ones related to the approximation 
of the simulation results, such as T-leaping and hybrid techniques [4]. 
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Figure 1: Output of the CWC simulator for gene regulation in E. Coli model: 
average of 100 independent instances with variance (90% confidence) computed 
a fixed simulation time steps. 



3.1 What can speeded up? Where parallelism can be 
found? 

3.1.1 Speeding up a single simulation 

Parallelising a single Gillespie-like stochastic simulation, i.e. the derivation of 
a simulation trajectory, is intrinsically hard. Unless introducing algorithmic re- 
laxations - which correctness should be proved and typically lead to algorithm 
approximations - two successive Monte Carlo steps cannot be concurrently ex- 
ecuted since there exists a strict data dependency between the two steps. Also, 
at the single step grain, speculative execution is unfeasible because of the ex- 
cessive branching of possible future execution paths. 1 As result, the only viable 
option to exploit parallel computing within a single simulation consists in par- 
allelising the single Monte Carlo step. Here, the available concurrency could be 
determined via data dependency analysis [9] that can be made for any given 
specific simulator code (see Sec. 5). Typically, parallelism exploited at this level 
is extremely fine-grained since the longest concurrent execution path may at 
most count dozens to hundreds machine instructions. 

In this range, currently, no software mechanisms can support an effective 
inter-core or multi-processor parallelisation: the overhead will easily overcome 
any benefit; the only viable option is hardware parallelism within a single core. 
Since, typically, instruction stream parallelism is already exploited by super- 
scalar processor architecture, the only additional parallelisation opportunity has 

1 Even if also this kind of approach has been attempted and it might result feasible whether 
coupled with algorithm relaxations [17]. 
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to be searched in data parallelism to be exploited via a hardware accelerator, 
such as internal SSE/MMX or external GPGPU accelerators (General-Purpose 
GPU) . In both cases, the simulator code should be deeply re-designed in a con- 
tiguous sequence of SIMD instructions. As we shall see in Sec. 5, this generally 
may lead to very modest advantages with respect to the required effort. 

3.1.2 Speeding up independent simulation instances 

The intrinsic complexity in the parallelisation of the single step has tradition- 
ally led to the exploitation of parallelism in the computation of independent 
instances of the same simulation, which should anyway be computed to achieve 
statistical convergence of simulated trajectories (as in all Monte Carlo methods). 
The problem is well understood; it has been exploited in the last two decades in 
many different flavours and distributed computing environments, from clusters 
to grid to clouds. Notwithstanding that the problem has been often approached 
in a simplified form, often assuming that output data has a negligible size, as 
it happens in Monte Carlo Pi computation; this is not likely to happen in this 
and next generation biological simulations. 

In particular, simulation distribution, result gathering, trajectory data as- 
sembling and analysis phases arc neither considered as a part of the problems 
to be accelerated nor considered in the performance evaluation. As matter of a 
fact, parallel simulations is often considered an "embarrassingly parallel" prob- 
lem, whereas it is - if and only if - data distribution, gathering, filtering, and 
analysis are not considered as part of the whole process. Unfortunately, it hap- 
pens that they may result as expensive as the simulation itself. As an example, 
a simulation of the HIV diffusion problem (computed using the StochKit toolkit 
for 4 years of simulation time) produces about 5 GBytes of data per instance 
[1]. As clear, the data size is n- folded when n instances are considered. Dur- 
ing post-processing phase, this data should be gathered and often reduced to a 
single trajectory via statistical methods. 

These potential performance flaws are further exacerbated in multi-core and 
many-core platforms. These platforms do not exhibit the same degree of repli- 
cation of hardware resources that can be found in distributed environments, and 
even independent processes actually compete for the same hardware resources 
within the single platform, such main and secondary memory, which perfor- 
mances represent the real challenge of forthcoming parallel programming models 
(a.k.a. memory wall problem). While simulation is substantially a CPU-bound 
problem on distributed platform, it may become prevalently an I/O-bound prob- 
lem on a multi-core platform due to the need to store and post-process many 
trajectories. In particular, multi-stable simulations may require very fine grain 
resolution to discriminate trajectory state changes, and as it is clear, the finer 
the observed simulation time-step the strongest the computational problem is 
characterised as I/O-bound. 
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3.2 How to parallelise? A list of guidelines for the effective 
parallelisation on multi-core 

In the previous section we discussed where parallelism can be found in Gillespie- 
like algorithms; the question that naturally follows is how this parallelism can be 
effectively exploited. We advocate here a number of parallelisation issues that, 
we believe, can be used as pragmatic "guidelines" for the efficient parallelisation 
of this kind of algorithms on multi-core. Observe that, in principle, they are 
quite independent of the source of parallelism; however, they focus on inter-core 
parallelism, thus cannot be expected to be applied to other kinds of parallelism 
(e.g. SIMD parallelism). They will be then used along Sec. 5 as "instruments" to 
evaluate the quality of the parallelisation work for the execution of independent 
instances of the CWC simulator. 

3.2.1 Data stream as a first-class concept 

The in silico (as well as in vitro) analysis of biological systems produces a huge 
amount of data. Often, they can be conveniently represented as data streams 
since they sequentially flows out from one or more (hardware or software) de- 
vices; often the cost of full storage of these streams overcomes their utility, as in 
many cases only a statistical filtering of the data is needed. These data streams 
can be conveniently represented as first-class concept] their management should 
be performed on-line by exploiting the potentiality of underlying multi-core 
platforms via a suitable high-level programming tools. 

3.2.2 Effective, high-level programming tools 

To date, parallel programming has not embraced much more than low-level com- 
munication and synchronisation libraries. In the hierarchy of abstractions, it is 
only slightly above toggling absolute binary in the front panel of the machine. 
We believe that, among many, one of the reasons for such failure is the fact 
that programming multi-core is still perceived as a branch of high-performance 
computing with the consequent excessive focus on absolute performance mea- 
sures. By definition, the raison d'etre for high-performance computing is high 
performance, but MIPS, FLOPS and speedup need not be the only measure. 
Human productivity, total cost and time to solution are equally, if not more, 
important. The shift to multi-core is required to be graceful in the short term: 
existing applications should be ported to multi-core systems with moderate ef- 
fort. This is particularly important when parallel computing serves as tools for 
other sciences since non expert designer should be able to experiment differ- 
ent algorithmic solutions fro both simulations and data analysis. This latter 
point, in particular, may require data synchronisation and could represent a 
very critical design point for both correctness and performance. 
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3.2.3 Cache-friendly synchronisation for data streams 

Current commodity multi-core and many-core platforms exhibit a cache-coherent 
shared memory since it makes it can effectively reduce the programming com- 
plexity of parallel programs (whereas different architectures, such as IBM Cell, 
have exhibited their major limits in programming complexity). Cache coherency 
is not for free, however. It largely affects synchronisations cost and may require 
expensive performance tuning. This is both an opportunity and a challenge 
for parallel programming framework designers since a properly designed frame- 
work should support the application with easy exploitation of parallelism (either 
design from scratch or porting from sequential code) and high-performance. 

3.2.4 Load balancing of irregular workloads 

Stochastic processes exhibit an irregular behaviour in space and time by their 
very nature since different simulations may cover the same simulation timespan 
following many different, randomly chosen, paths and number of iterations. 
Therefore, parallelisation tools should support the dynamic and active balancing 
of workload across the involved cores. 

4 Pattern-based high-level stream parallelism 

Stream parallelism is a programming paradigm supporting the parallel execu- 
tion of a stream of tasks by using a series of sequential or parallel stages. A 
stream program can be naturally represented as a graph of independent stages 
(kernels or filters) that communicate explicitly over data channels. Conceptu- 
ally, a streaming computation represents a sequence of transformations on the 
data streams in the program. Each stage of the graph reads one or more tasks 
from the input stream, applies some computation, and writes one or more out- 
put tasks to the output stream. Parallelism is achieved by running each stage 
of the graph simultaneously on subsequent or independent data elements. 

As with all kinds of parallel program, stream programs can be expressed 
as a graph of concurrent activities, and directly programmed using a low-level 
shared memory or message passing programming framework. Although this is 
still a common approach, writing a correct, efficient and portable program in 
this way is a non-trivial activity. Attempts to reduce the programming effort by 
raising the level of abstraction through the provision of parallel programming 
frameworks date back at least three decades and have resulted in a number of sig- 
nificant contributions. Notable among these is the skeletal approach [If] (a.k.a. 
pattern-based parallel programming), which appears to be becoming increas- 
ingly popular after being revamped by several successful parallel programming 
frameworks [13, 21]. 

Skeletons (a.k.a. patterns) capture common parallel programming paradigms 
(e.g. MapReduce, ForAll, Divide&Conquer, etc.) and make them available 
to the programmer as high-level programming constructs equipped with well- 
defined functional and extra- functional semantics [2]. 
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The pipeline skeleton is one of the most widely-known, although sometimes it 
is underestimated. Parallelism is achieved by running each stage simultaneously 
on subsequent data elements, with the pipeline's throughput being limited by 
the throughput of the slowest stage. 

The farm skeleton models functional replication and consists in running 
multiple independent stages in parallel, each operating on different tasks of the 
input stream. The farm skeleton is typically used to improve the throughput 
of slow stages of a pipeline. It can be better understood as a three stage - 
emitter, workers, collector - pipeline. The emitter dispatches stream items to 
a set of workers, which independently elaborate different items. The output of 
the workers is then gathered by the collector into a single stream. These logical 
stages are considered by a consolidated literature as the basic building blocks 
of stream programming. 

The loop skeleton (also known as feedback), provides a way to generate cycles 
in a stream graph. This skeleton is typically used together with the farm skeleton 
to model recursive and Divide&Conquer computations. 

In particular, the FastFlow implementation of the loop and farm patterns 
will be exploited in Sec. 5 to parallelise the CWC simulator. 

4.1 The FastFlow skeleton-based programming framework 

FastFlow is a C++ parallel programming framework aimed at simplifying the 
development of efficient applications for multi-core platforms. The key vision 
of FastFlow is that ease-of-development and runtime efficiency can both be 
achieved by raising the abstraction level of the design phase, thus providing 
developers with a suitable set of parallel programming patterns that can be 
efficiently compiled onto the target platforms [28]. 

FastFlow is conceptually designed as a stack of layers that progressively ab- 
stract the shared memory parallelism at the level of cores up to the definition of 
useful programming constructs supporting structured parallel programming on 
cache-coherent shared memory multi- and many-core architectures (see Fig. 2). 
These architectures include commodity, homogeneous, multi-core systems such 
as Intel core, AMD K10, etc. FastFlow natively supports stream parallelism 
since it implements parallelism patterns as data-flow graphs - so-called stream- 
ing networks. 

The core of the FastFlow framework (i.e. run-time support tier) provides 
an efficient implementation of Single-Producer-Single-Consumer (SPSC) FIFO 
queues. FastFlow SPSC queues are lock-free, wait-free, and do not use inter- 
locked operations [28, 19]. 

The SPSC queue is primarily used as synchronisation mechanism for memory 
pointers in a consumer-producer fashion. The next tier up extends one-to-one 
queues (SPSC) to one-to-many (SPMC), many-to-one (MPSC), and many-to- 
many (MPMC) synchronisations and data flows, which are implemented using 
only SPSC queues and arbiter threads, thus providing lock-free and wait-free 
arbitrary data-flow graphs (arbitrary streaming networks) that requires few or 
no memory barriers, and thus few cache invalidations. 
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Figure 2: FastFlow layered architecture with abstraction examples at the dif- 
ferent layers of the stack. 



The upper layer, i.e. high-level programming, provides a programming frame- 
work based on parallel patterns. In particular, FastFlow provides farm, farm- 
with- feedback (i.e. Divide&Conquer) and pipeline patterns, and supports their 
arbitrary nesting and composition. The FastFlow pattern set can be further 
extended by building new C++ templates. 

FastFlow is available as an open source software under LGPLv3 [28]. A 
performance comparison against other programming tools such as POSIX, Cilk, 
OpenMP, and Intel TBB has been reported in [28, 3]. 

5 The CWC Simulator Testbed 

The proposed guidelines are validated using the CWC simulator as running 
example. It has been developed as a plain C++ sequential code (exploiting the 
CH — h boost library), then it has been parallelised for multi-core. In order to 
evaluate the effectiveness of the methodology also in term of development effort, 
the original code and its parallel versions have been developed by two different 
(master student) teams. In the parallelisation two main frameworks have been 
used: the GCC compiler SSE intrinsics [20] to speed up a single simulation, and 
the FastFlow parallel programming framework [28] to speed up independent 
simulation instances, which provides the basic facilities described in Sec. 3.1.2 
and that is briefly recapped in Sec. 4.1. 

All reported experiments have been executed on an Intel workstation with 
2 quad-core Xeon E5520 Nehalem (16 HyperThreads) @2.26GHz with 8MB L3 
cache and 24 GBytes of main memory with Linux x86_64. The Nehalem pro- 
cessor uses Simultaneous MultiThreading (SMT, a.k.a. HyperThreading) with 
2 contexts per core and the Quickpath interconnect equipped with a distributed 
cache coherency protocol. SMT technology makes a single physical processor 
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appear as two logical processors for the operating system, but all execution 
resources are shared between the two contexts. Each core is equipped with a 
SSE4.2 SIMD engine. 

5.1 Speeding up a single simulation 

As discussed in Sec. 3.1.1, the parallelisation of the single CWC simulation 
step is theoretically feasible via the SSE accelerator. The pseudo-code of the 
simulation step is sketched in Fig. 3. In the figure, the phases of the code that 
can be parallelised in SIMD fashion with moderate effort are marked with the 
"SIMD" label. The exploited parallelism degree is 4 since 4x32-bit operation 
has been used; Fig. 4 reports the achieved speedup on a single core for n- 
species of the Lotka-Volterra models (the 2-species case is the standard prey- 
predator model). Despite SSE exhibits very low overhead, the achieved speedup 
is almost negligible because only a fraction of the whole simulation step has been 
actually parallelised (Amdahl's law's applies [5]). Similar parallelisation efforts 
conducted on GPGPU accelerators, which exploit a much larger potential SIMD 
parallelism, do not actually result in satisfactory results. As an example, see 
the parallelisation of Gillespics first reaction method on NVIDIA CUDA [16]. 

Unfortunately, the extension of the SIMD parallelism to larger fractions of 
the code may require a very high coding effort since the redesign of the original 
code is required. As an example recursive patterns (used for tree-matching, 
marked with "non-SIMD" parallelism in Fig. 3) are not easily manageable us- 
ing SIMD parallelism and should be differently coded before being parallelised. 
Observe that these recursive kernels cannot either be parallelised across cores 
because they are excessively fine-grained; as an example the parallelisation via 
POSIX threads (tested with FastFlow and Intel TBB) is, in our the reference 
platform, from 10 to 100 times slower with respect to sequential version due to 
synchronisation overheads (i.e. cache coherence, cache misses, etc.). 

All in all, intra-core SIMD parallelism appears the only viable way to this 
kind of parallelisation. Observe however that if it might require, for this class 
of algorithms, a coding effort that easily overcomes the potential benefits. 

5.2 Speeding up independent simulation instances 

Starting from the CWC sequential simulator code sketched in Fig. 3, we here 
advocate a parallelisation schema supporting the parallel execution of many self- 
balancing simulation instances on multi-core. Its design aims to address all the 
issues discussed in Sec. 3.1.2: it is realised by means of the FastFlow framework 
(see Sec. 4.1) that natively supports high-level parallel programming patterns 
working on data streams and it exhibits an efficient lock-free run-time support 
that can be integrated with SIMD code. It therefore makes it possible the easy 
porting of the sequential CWC code on multi-core for the execution of multiple 
simulation instances (either replicas or the parameter sweeping of a simulation), 
and the on-line synthesis of their trajectories, which can be made according to 
one or more associative reduction functions, e.g. average, variance, confidence. 



12 



SimulatiomStep { 
// 1. Match 

foreach r G ruleset { 
Match(r, T, TOPJLEVEL); // [non-SIMD parallelism] 
// 2. Resolve (Monte Carlo) 

(tau, mu) = Gillespie(matchset); 

context = stochastic choice on matchset [mu] ; 
// 3. Update 

(P,0) = left_and_right_side (mu); 

delete P_sigma from T at context; // SIMD 

put deletecLelements in sigma; 

add CLsigma to T at context; // SIMD 

simclock += tau; 

} 

Match(rule, term, context) { 
P = left_hand_side(rule) ; 

count_population = Match_Populations(atoms(P), atoms(term)); // SIMD 
count_compartments = match compartments (term) against compartments(P); 
stoch_rate = count_population * count_compartments * kinetic(rule); 
put (context, stoch_rate) in matchset [rule ] ; 
foreach c_t G compartments(term) 

Match(rule, content(c_t) , c_t); // recursive step [non— SIMD parallelism] 

} 

Match_Populations(atoms_object, atoms_subject) { 
count = 1; 

foreach (x, k) G atoms_object { 
find (x, n) in atoms_subject; 
count *= binomial(n, k); 

} 

return count; 

} 

Figure 3: CWC simulator pseudo-code (see also Sec. 2.3) with possible sources 
of fine-grain parallelism. 
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SIMD (S) 
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Ideal speedup 


2 
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70.743 


70.043 


1.01 
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16 


284.276 


278.701 


1.02 
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32 


1121.231 


1099.245 


1.02 


4 



Figure 4: Execution time (S) and speedup of the SIMD CWC simulator against 
the sequential version on the n-species Lotka- Volterra. 
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Figure 5: Three alternative parallelisation schemas exemplified on 6 simulation 
instances and 3 processors, i) Round-robin execution of simulations followed 
by a reduction phase, ii) Auto-balancing schema with time-slicing at constant 
simulation time (variable wall-clock time) followed by a reduction phase, iii) 
Previous schema with on-line pipelined reduction. 



14 



The schema supports three main behaviours, which are exemplified in Fig. 5: 

i) The different simulation instances (called a,b,c,d,e,f) are dispatched for the ex- 
ecution on different workers threads of a FastFlow farm, which run on different 
cores; a worker sequentially runs all the simulations it received. The dispatching 
of instances to workers could be either performed before the execution accord- 
ing to some static policy (e.g. Round-Robin) or via an on-line scheduling policy 
(e.g. on-demand). Workers stream out the trajectories, which are sampled along 
fixed steps along simulation time. Streams are buffered in the farm collector 
and then reduced in a single stream according to one or more functions (e.g. F). 
Observe that the constant sampling assumption simplify the reduction process 
even if it is not strictly required since data could be on-line re-aligned during 
the buffering [1]. Also notice that since simulation time advances according to a 
random variable, different instances advance at different wall-clock time rates. 
The phenomenon is highlighted in Fig. 5 i) splitting each instance in four equal 
fractions of the simulation time (e.g. (al, a2, a3, a4), (bl, b2, b3, b4), which 
exhibit different wall-clock time to be computed (segment length). This may 
induce even a significant load unbalance that could be only partially addressed 
using on-line scheduling policies. 

ii) A possible solution to improve load balancing of the schema consists in cou- 
pling the on-line scheduling policy with the reduction of execution time-slice 
that is subject to the scheduling policy. At this end, each simulation instance 
can be represented as an object that incorporate its current progress and provide 
the scheduler with the possibility of stopping and restarting an instance. In this 
way, as it happens in a time-sharing operating system, (fixed or variable length) 
slices of an instance can be scheduled on different workers provided slices of the 
same instances are sequenced (possibly on different workers). Thanks to cache- 
coherent shared memory the scheduling can be efficiently realised via pointer 
management. The idea is exemplified in Fig. 5 ii). Also, scheduling and dis- 
patching to workers can be equipped with predictive heuristics based on instance 
history in order to characterise the relative speed of the simulation instances. 

Hi) The previous schema can be further improved by pipelining the reduction 
phase that is performed on-line. Since instance time-slicing can make all the 
instances to progress, a running window of all the trajectories can be reduced 
while they are still being produced. The reduction process, which is logically 
made within a separate thread (i.e. the farm collector), can be either run on an 
additional processor or interleaved with the execution of simulation instances 
(see Fig. 5 Hi). The solution also significantly reduces the amount of data to be 
kept in memory because: 1) thanks to interleaving all the trajectories advances 
almost aligned with respect to simulation time; 2) the already reduced parts 
of the trajectories can be deleted from main memory (and stored in secondary 
memory if needed) . 

The three schemas can be effectively implemented using FastFlow as sketched 



15 




F1(a,...f) 
F2(a,...f) 



Figure 6: Architecture of the FastFlow-based CWC parallel simulator. 



in Fig. 6. In particular, the FastFlow farm accelerator feature [28] fits well the 
previous design since it makes possible to offload 2 a stream of object pointers 
onto a farm of workers, each of them running a CWC simulator, and to im- 
plement user-defined dispatching and reduction functions via standard Object 
Oriented subclassing. As discussed in Sec. 4.1, FastFlow natively provides the 
programmer with streams, a configurable farm pattern, and an efficient run- 
time support based on lock-free synchronisations. All these features effectively 
made it possible to port the CWC sequential simulator to multi-core with mod- 
erate effort (few days for our team of students). In addition, the complexity 
of the achieved solution can be gracefully improved by successive refinements 
in order to test different scheduling policies or variants to the basic schema. 
In this regard the accelerator feature represents a key issue since it enables 
the programmer to make very local changes to the original code that in first 
approximation consists in changing a method call into the offload of the same 
method. 

Figure 7 reports the achieved speedup and scalability for schema Hi), evalu- 
ated on multiple instances of simulation over the Lotka-Volterra model. In the 
speedup plot, the measures for the parallel version includes the time spent for 
computing reductions (means, variance, and confidence), whereas this cost is not 
measured in the sequential version as it is managed as post-processing phase; 
nevertheless, the achieved speedup is reasonably good. The quality of paral- 
lelisation work is further confirmed by the scalability plot, where the parallel 
version is compared with a sequential code that computes the same reductions 
(means, variance, and confidence). In this case the achieved scalability is close 
to the ideal one. Observe that, as typical in memory-bound CPU-intensive 
workloads, HyperThreading does not bring any additional benefit. 

2 So-called self-offload since the accelerator use the same hardware device of the main 
thread, i.e. multi-core CPU, see Fig. 2 [28]. 
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Figure 7: Speedup (T por ( n ) vs T seg ) and scalability (T por („) vs T par ( 1) ) of the 
parallel CWC simulator, see Sec. 5.2 mj. 
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6 Related Works 



The parallelisation of stochastic simulators has been extensively studied in the 
last two decades. Many of these efforts focus on distributed architecture and 
specific simulators. Our work differs from these efforts in three main aspects: 
1) it mainly address multicore-specific parallelisation issues; 2) it advocates a 
general parallelisation schema rather than a specific simulator, 3) it specifically 
address the on-line reduction of simulation trajectories, thus it is designed to 
manage large streams of data. To the best of our knowledge, many related 
works covers some of these aspects, but very few of them (if any) address all 
three aspects. Among related works, some are worth to be explicitly mentioned. 

The Swarm algorithm [27], which is well suited for biochemical pathway 
optimisation has been used in a distributed environment, e.g., in Grid Cell- 
ware [15], a grid-based modelling and simulation tool for the analysis of biolog- 
ical pathways that offers an integrated environment for several mathematical 
representations ranging from stochastic to deterministic algorithms. 

Parameter Sweep Applications (PSAs) exploit that aim must involve making 
the problem very time consuming. However, since the instances of a PSA are 
independent, the distributed computing paradigm to to sample a large space of 
independent instances. In [23] , a grid-based version of a multi- volume stochastic 
simulator is presented. 

DiVinE [8] is a general distributed verification environment meant to support 
the development of distributed enumerative model checking algorithms. In [7], 
features about probabilistic analysis have been added to DiVinE. The model 
checker has been massively used for the analysis of biological systems, see, e.g., 
[6] 

StochKit [24] is an extensible stochastic simulation framework developed 
in the C++ language. It aims at making stochastic simulation accessible to 
biologists, while remaining open to extension via new stochastic and multi- 
scale algorithms. It implements the Gillespie algorithm, and other methods. 
It is a programming framework and in its second version it targets multi-core 
platforms, it is therefore similar to our work. It does not implement any kind of 
SIMD parallelism nor on-line trajectory reduction (that is performed as post- 
processing). 

7 Concluding remarks 

Starting from the Calculus of Wrapped Compartments we have discussed the 
main parallelisation issues for its simulator, and in general for the stochastic sim- 
ulation of biological systems, on commodity multi-core platforms. In particular, 
we distinguished two different approaches to parallelisation, i.e. the paralleli- 
sation of the single simulation instance and many simulation instances. For 
each class we have defined a number of design guidelines, which we believe, may 
support the easy and efficient porting of this class of algorithms on multi-cores. 
These guidelines include both the programming language abstractions (streams 
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and high-level programming patterns), the run-time mechanisms (intra-core 
SIMD parallelism, lock-free cache-friendly inter-core synchronisations here pro- 
vided by the FastFlow framework), and basic simulator architectural schema 
(simulation "objectification" , interleaved execution and pipelined reduction), 
which can be gracefully optimised with limited effort to experiment different 
parallel execution behaviours. 

The presented guidelines have been used to develop a multicore-aware port- 
ing of the CWC simulator, which have been experimented over classic simulation 
problems. The experimental evidence obtained in the design and utilisation of 
the parallel simulator are convincing both in term of the achieved performance 
and the moderate porting effort for the parallelisation of multiple instances 
whereas it appears disappointing for the parallelisation of the single instance. 

Both the FastFlow framework and the CWC simulator are open source soft- 
ware under LGPL licence [28, 29]. 
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